Generalization properties of finite size polynomial Support Vector Machines 
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The learning properties of finite size polynomial Support Vector Machines are analyzed in the case 
of realizable classification tasks. The normalization of the high order features acts as a squeezing 
factor, introducing a strong anisotropy in the patterns distribution in feature space. As a function 
, of the training set size, the corresponding generalization error presents a crossover, more or less 

abrupt depending on the distribution's anisotropy and on the task to be learned, between a fast- 
decreasing and a slowly decreasing regime. This behaviour corresponds to the stepwise decrease 
found by Dietrich et al. [1] in the thermodynamic limit. The theoretical results are in excellent 
agreement with the numerical simulations. 
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I. INTRODUCTION 



In the last decade, the typical properties of neural networks that learn classification tasks from a set of examples have 
been analyzed using the approach of Statistical Mechanics. In the general setting, the value of a binary output neuron 
represents whether the input vector, describing a particular pattern, belongs or not to the class to be recognized. 
Manuscript character recognition and medical diagnosis are examples of such classification problems. The process 
of inferring the rule underlying the input-output mapping given a set of examples is called learning. The aim is to 
predict correctly the class of novel data, i.e. to generalize. 

In the simplest neural network, the perceptron, the inputs are directly connected to a single output neuron. The 
output state is given by the sign of the weighted sum of the inputs. Then, learning amounts to determine the weights 
of the connexions in order to obtain the correct outputs to the training examples. Considering the weights as the 
components of a vector, the network classifies the input vectors according to whether their projections onto the weight 
\ vector are positive or negative. Thus, patterns of different classes are separated by the hyperplane orthogonal to the 
weight vector. Beyond these linear separations, two different learning schemes have been suggested. Either the input 
vectors are mapped by linear hidden units to so called internal representations that must be linearly separable by the 
output neuron, or a more powerful output unit is defined, able to perform more complicated functions than just the 
weighted sum of its inputs. 

The first solution is implemented using feedforward layered neural networks. The classification of the internal 
representations, performed by the output neuron, corresponds in general to a complicated separation surface in input 
space. However, the relation between the number of hidden units of a network and the class of rules it can infer is 
still an open problem. In practice, the number of hidden neurons is cither guessed or determined through constructive 
heuristics. 

A solution that uses a more complex output unit, the Support Vector Machine (SVM) Q, has been recently 
proposed. The input patterns are transformed into high dimensional feature vectors whose components may include the 
original input together with specific functions of its coordinates selected a priori, with the aim that the learning set be 
linearly separable in feature space. In that case the learning problem is reduced to that of training a simple perceptron. 
For example, if the feature space includes all the pairwise products of the input vector, the SVM may implement any 
classification rule corresponding to a quadratic separating surface in input space. Higher order polynomial SVMs and 
other types of SVMs may be defined by introducing the corresponding features. A big advantage is that learning a 
linearly separable rule is a convex optimization problem. The difficulties of having many local minima, that hinder the 
process of training multilayered neural networks, are thus circumvented. Once the adequate feature space is defined, 
the SVM selects the particular hyperplane called Maximal Margin (or Maximal Stability) Hyperplane (MMH), which 
lies at the largest distance to its closest patterns in the training set. These patterns are called Support Vectors (SV). 
The MMH solution has interesting properties In particular, the fraction of learning patterns that belong to the 
SVs provides an upper bound (!]] to the generalization error, that is, to the probability of incorrectly classifying a new 
input. It has been shown j3| that the perceptron weights are a linear combination of the SVs, an interesting property 
in high dimensional feature spaces, as their number is bounded. 
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A perceptron can learn with very high probability any set of examples, regardless of the underlying classification 
rule, provided that their number does not exceed twice its input space dimension Qj. However, this simple rote 
learning does not capture the rule underlying the classification. As it may arise that the feature space dimension 
of the SVM is comparable to, or even larger than, the number of available training patterns, we would expect that 
SVMs have a poor generalization performance. Surprisingly, this seems not to be the case in the applications p]. 

Two theoretical papers have recently addressed this interesting question. They determined the typical 

properties of a family of polynomial SVMs in the limit of large dimensional spaces, reaching completely differ- 
ent results in spite of the seemingly innocuous differences between the models. Both papers consider polynomial 
SVMs in which the input vectors x S ffi^ are mapped onto quadratic features More precisely, the normalized 



mapping 3? n (x) = (x, xix/v N, X2 x /v N, ■ ■ ■ , x^x/y' N) has been considered in |6[. The non-normalized mapping 
$„„(x) = (x, xix, X2X, • • • ,Xfcx) has been studied in as a function of k, the number of quadratic features. For 
k = N the dimension of both feature spaces is the same, corresponding to a linear subspace of dimension iV, and 
a quadratic subspace of dimension N 2 . The mappings only differ in the distributions of the quadratic components 
in feature space. Due to the normalization, those of <&„ are squeezed by a normalizing factor a = l/y/N with re- 
spect to those of $, m . In the case of learning a linearly separable rule with the non-normalized mapping 4? rm , the 
generalization error at any given learning set size increases dramatically with the number k of quadratic features 
included JtJ . On the contrary, in the case of mapping €>„ , the generalization error exhibits an interesting stepwise 
decrease, also found within the Gibbs learning paradigm in a quadratic feature space [|| . If the number of training 
patterns scales with N, the dimension of the linear subspace, it decreases up to an asymptotic lower bound. If the 
number of examples scales proportionally to TV 2 , it vanishes asymptotically. In particular, if the rule to be inferred is 
linearly separable in the input space, learning in the feature space with the mapping 3> n is harmless, as the decrease 
of the generalization error with the number of training patterns presents a slight slow-down with respect to that of a 
simple perceptron learning in input space. 

As this stepwise learning is exclusively related to the fact that the normalizing factor of the quadratic features 
vanishes in the thermodynamic limit N — > 00, in the present paper we determine the influence of the normalizing 
factor on the typical generalization performance of finite size SVMs. To this end, we introduce two parameters, a and 
A, caracterizing the mapping of the V-dimensional input patterns onto the feature space. The variance a reflects the 
width of the high-order features distribution and is related to the normalizing factor a. The inflation factor A accounts 
for the proportion of quadratic features with respect to the input space dimension N . Actual quadratic SVMs are 
caracterized by different values of A and er, depending on N and a. Keeping a and A fixed in the thermodynamic 
limit allows us to determine the typical properties of actual SVMs, which have finite compressing factors and inflation 
ratios. 

In fact, the behaviour of the SVMs is the same as that of a simple perceptron learning a training set with patterns 
drawn from a highly anisotropic probability distribution, such that a macroscopic fraction of components have a 
different variance from the others. Not surprisingly, we find that the asymptotic behaviour corresponding to both the 
small and large training set size limits, is the same as the one of the perceptron's MMH. Only the prefactors depend 
on the mapping used by the SVM. 

As expected, the stepwise learning obtained with the normalized mapping in the thermodynamic limit becomes a 
crossover. Upon increasing the number of training patterns, the generalization error first present an abrupt decrease, 
that corresponds to learning the weight components in the linear subspace, followed by a slower decrease corresponding 
to the learning of the quadratic components. The steepness of the crossover not only depends on A and a, but also 
on the task to be learned. The agreement between our analytic results and numerical simulations is excellent. 

The paper is organized as follows: in section || we introduce the model and the main steps of the Statistical 
Mechanics calculation. Numerical simulation results are compared to the corresponding theoretical predictions in 
section [II. The two regimes of the generalization error and the asymptotic behaviours are discussed in section IV. 
The conclusion is left to section |v|. 



II. THE MODEL 



We consider the problem of learning a binary classification task from examples with a SVM in polynomial feature 
spaces. The learning set contains M patterns (x^r^ 1 ) (p = 1, •••,&£) where x M is an input vector in the N- 
dimensional input space, and r M € { — 1,1} is its class. We assume that the components x^ (i — l,---,N) are 
independent identically distributed (i.i.d.) random variables drawn from gaussian distributions having zero-mean and 
unit variance: 
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In the following we concentrate on quadratic feature spaces, alt houg h our conclusions are more general, and may be 



applied to higher order polynomial SVMs, as discussed in section IV. The mappings 4> nn (x) = (x, x±x, 22X, • • • , xjvx) 
and # n ( x ) = ( x j Xix/yN,X2x/yN, ■ ■ ■ ,acjvx/viV) are particular instances of mappings of the form <&(x) = 
(0i, 02, • • ' j 0Wj 0n j 012, ■ • • , 4>nn) where 0j = £j, and 0y = a XiXj, where a is the normalizing factor of the quadratic 
components: a = 1 for mapping <& nn and a = 1/yN for 3?„. The patterns probability distribution in feature-space 
is: 

r JL At ( r 2 \ 

P(*) = / n^=e x p(-Yj^-^)II 5 (^- a ^)- ( 2 ) 

2=1 V ^ ' j = l 

Clearly, the components of <& are not independent random variables. For example, a number 0(N 3 ) of triplets of the 
form <pij(j)jk4>ki have positive correlations. These contribute to the third order moments, which should vanish if the 
features were gaussian. Moreover, the fourth order connected correlations j^] do not vanish in the thermodynamic 
limit. Nevertheless, in the following we will neglect these and higher order connected moments. This approximation, 
used in j?J and implicit in is equivalent to assuming that all the components in feature space are independent 
gaussian variables. Then, the only difference between the mappings <!>„ and <!>„„ lies in the variance of the quadratic 
components distribution. The results obtained using this simplification are in excellent agreement with the numerical 
tests described in the next section. 

Since, due to the symmetry of the transformation, only N(N + 1) /2 among the TV 2 quadratic features are different, 
hereafter we restrict the feature space and only consider the non redundant components, that we denote £ = (£„,£ CT )- 
Its first N components £ u = (£1, • • • , £at) hereafter called u- components, represent the input pattern of unit variance, 
lying in the linear subspace. The remaining components £ CT = (£/v+i; 1 ' ■ ,£/v) stand for the non redundant quadratic 
features, of variance er, hereafter called a- components. N is the dimension N — N(X+ A) of the restricted feature space, 
where the inflation ratio A is the relative number of non-redundant quadratic features per input space dimension. 
The quadratic mapping has A = (N + l)/2. 

According to the preceding discussion, we assume that learning ./V-dimcnsional patterns selected with the isotropic 
distribution (Q) with a quadratic SVM is equivalent to learning the MMH with a simple perceptron in an iV-dimensional 
space where the patterns are drawn using the following anisotropic distribution, 

p(«-n£«p(-f)x^-*(-£)- (3) 

The second moment of the u-features is = N and that of the a-features is = NAa 2 . If a 2 A = 1, we get 
(£ 2 ) = which is the relation satisfied by the normalized mapping considered in |6|. The non- normalized mapping 
corresponds to a 2 A = N. In the following, instead of selecting either of these possibilities a priori, we consider A and 
a as independent parameters, that are kept constant when taking the thermodynamic limit. 

Since the rules to be inferred are assumed to be linear separations in feature space, we represent them by the 
weights w* = (w*, • • • , of a teacher perceptron, so that the class of the patterns is r = sign(£ • w*). Without 

any loss of generality we consider normalized teachers: w* • w* = N. The training set in feature space is then 

In the following we study the typical properties of polynomial SVMs learning realizable classification tasks, using 
the tools of Statistical Mechanics. If w = (wi, ■ ■ ■ , w^) is the student perceptron weight vector, 7^ = t^ 1 ^ •w/y / w • w 
is the stability of pattern fj, in feature space. The pertinent cost function is : 

M 

S(w, K ;£ M ) =^e(«- 7 M ). (4) 

k, the smallest allowed distance between the hyperplane and the training patterns, is called the margin. The MMH 
corresponds to the weights with vanishing cost (IJ) that maximize k. 

The typical properties of cost (Q) in the case of isotropic pattern distributions have been exhaustively studied [^],[l2| . 
The case of a single anisotropy axis has also been investigated fl(i|] . Here we study the case of the anisotropic 
distribution (0), where a macroscopic fraction of components have different variance from the others, which is pertinent 
for understanding the properties of the SVM. 

Considering the cost (^) as an energy, the partition function at temperature 1//3 writes 

Z(k,(3;C m )= J exp[-0E(vr, k;£ m )} p(w)efw. (5) 
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Without any loss of generality, we assume that the a priori distribution of the student weights is uniform over the 
hypersphere of radius TV 1 / 2 , i.e. p(v?) = 8(w ■ w — N), meaning that the student weights are normalized in feature 
space. In the limit {3 — > oo, the corresponding free energy /(k,/3;£m) = — (1/0N) lnZ(K, (3; Cm) is dominated by the 
weights that minimize the cost (^). 

The typical properties of the MMH are obtained by looking for the largest value of K for which the quenched average 
of the free energy over the patterns distribution, in the zero temperature limit (3 — > oo, vanishes. This average is 
calculated by the replica method, using the identity 



In Z" («,/?; C m ) 



/(«,/*) = - W ^Z M ;C M ) = - W lim (6) 

where the overline represents the average over Cm, composed of patterns selected according to 

We obtain the typical properties of the MMH corresponding to given values of A and a by taking the thermodynamic 
limit N — > oo, M — > oo, with a = M/N, A and a constant. Notice that the relation between the number of training 
examples and the feature space dimension, a = M/N = a/ (1 + A), is finite. Thus, not only are we able to study the 
dependence of the learning properties as a function of the training set size as usual, but also of the inflation factor 
that characterizes the SVM, as well as of the variance of the quadratic components. As we only consider realizable 
rules, i.e. classification tasks that are linearly separable in feature space, the energy (Ji|) is a convex function of the 
weights w, and replica symmetry holds. 

For any k < n max , there are a macroscopic number of weights that minimize the cost function (Q). In partic- 
ular, in the case of n = 0, the cost is the number of training errors, and is minimized by any weight vector that 
classifies correctly the training set. The typical properties of such solution, called Gibbs learning, may be expressed 

in terms of several order parameters fPH| . Among them, q® b — J2i=i( w i w i)/N, q& b — Yl^=N+i( w i w i) I and 
Q a = J2iLN+i( w i w i) / where a/ii are replica indices and (■ ■ •) stands for the usual thermodynamic average (with 
Boltzmann factor corresponding to the partition function (||)). q^ b and q% b represent the overlaps between different 
solutions in the u- and the a- subspaces respectively. N Q a is the typical norm of the er-components of replica a. 
Because of replica symmetry we have Q a =Q b — Q, q% b — q a and q^ b = q u for all a, b. Upon increasing k, the volume 
of the error-free solutions in weight space shrinks, and vanishes when k is maximized. Correspondingly, q u — > 1 — Q 
and q a — > Q, with x = Zim K ^ Kmax (l — q u /{\ — Q))/(l — q a /Q) finite. In the limit of k — > n m ax, the properties of the 
MMH may be expressed in terms of x, n max and the following order parameters, 

1 * 



Q = - E ( w ">' ( ? ) 



N 

i=N+l 



Ru = 1 — V (wiwf), (8) 



i i N 

£ +i <»<>. m 

where Q* = J2iLN+i( w i) 2 1 1S * ne teacher's squared weight vector in the cr-subspace. Q is the corresponding typical 
value for the student. R u and are proportional to the overlaps between the student and the teacher weights in the 
u- and the a- subspaces respectively. The factors in the denominators arise because the weights are not normalized 
in each subspace. 

The saddle point equations corresponding to the extremum of the free energy for the MMH arc 



?2 



(x + A a ) 



2 




(10) 



a 



2> _^(!-4^t)f±£, m 



Rl At R 



2 



I- Rl A 1 - Rl 



(13) 
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where A CT = <j 2 Q/(1 — Q) and A* = cr 2 Q*/(l — Q*). The integrals in the left hand side of equations (llOfO) are 



/oo 
Dt (t + k) 2 H 
-k 



tR 



'2tt 



VT=R?J ' 

VT^FP cxp(-K 2 /(2(l - R 2 ))) 



iH 



/oo 
Dtk(t + k)H 
-k 



tR 



^l-R 2 

with Dt = dt exp (~t 2 /2)/V^, H(x) = /J° L>i, and 



i? = 



V(1-Q)(1 + A CT )' 
J? M + ^A a A*R a 
V(1 + A CT )(1 + A*) 



(14) 

(15) 
(16) 
(17) 

(18) 
(19) 



The value of i? determines the generalization error through e g = (1/ir) arccos(i?). 

After solving the above equations for Q, R u , R a , x and k, it is straightforward to determine psv, the fraction of 
training patterns that belong to the subset of SV [0,01(71 : 



Psv 



H l-tR/Vl- R 2 ) Dt 



(20) 



In summary of this section, instead of considering a particular scaling of the fraction of high order features com- 
ponents and their normalization with N, we analyzed the more general case where these quantities are kept as free 
parameters. We determined the saddle point equations that define the typical properties of the corresponding SVM. 
This approach allows us to consider several learning scenarios, and more interestingly, to study the crossover between 
the different generalization regimes. 



III. RESULTS 



We describe first the experimental data, obtained with quadratic SVMs, using both mappings, <& nn and 3> n , which 
have normalizing factors a = 1 and a = X/yN respectively, where N is the input space dimension. The M — aN 
random input examples of each training set were selected with probability (|l|) and labelled by teachers of normalized 
weights w* = (w*,w*) drawn at random, w* are the N components in the linear subspace and w* are the N 2 
components in the quadratic subspace. Notice that, because of the symmetry of the mappings, teachers having the 
same value of the symmetrized weights in the quadratic subspace, (w* y + w* «)/2, are all equivalent. The teachers 
are characterized by the proportion of (squared) weight components in the quadratic subspace, Q* — w* -w*/w* -w*. 
In particular, Q* = and Q* = 1 correspond to a purely linear and a purely quadratic teacher respectively. 

The experimental student weights w = (wi, w q ) were obtained by solving numerically the dual problem |q,|l4||, using 
the Quadratic Optimizer for Pattern Recognition program pl| , that we adapted to the case without threshold treated 
in this paper. We determined Q, and the overlaps Ri and R q in the linear and the quadratic subspaces, respectively. 
For each value of M, averages were performed over a large enough number of different teachers and training sets to 
get the precision shown in the figures. 

Experiments were carried out for N — 50. The corresponding feature space dimension is N(N + 1) = 2550. The 
restricted feature space considered in our model is composed of the N (linear) input components, which define the 
it-subspace of the feature space, and the NA non redundant quadratic components of the cr-subspace. For the sake 
of comparison with the theoretical results determined in the thermodynamic limit, we caracterize the actual SVM 
by its (finite size) inflation factor A = (N + l)/2, and the variance a 2 of the components in the cr-subspace, related 
to the normalizing factor a of the new features through a 2 — Na 2 /A. In our case, since N = 50, A = 25.5 and 
cr 2 = 1.960784a 2 , that is a 2 = 1.960784 for the non-normalized mapping and a 2 — 0.039216, for the normalized one. 
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FIG. 1. Order parameters of SVMs for purely linear teacher rules, Q* = 0. Symbols are experimental results for input space 
dimension N — 50, corresponding to the two kinds of quadratic mappings, $ n with a = l/\/ r N (full symbols) and &„„ with 
normalizing factor a = 1 (open symbols) respectively. Error bars are smaller than the symbols. The lines are solutions of 
equations (flfl|l4), for A = (N + l)/2 and a 2 = Na 2 /A with iV = 50, and a corresponding to each mapping. 
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FIG. 3. Order parameters of SVMs for isotropic teacher rules, Q* ao = A/(l + A). Definitions are the same as in figure 
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FIG. 4. Order parameters of SVMs for a general teacher rule, Q* — 0.5. Definitions are the same as in figure |l| 
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The values of Q, the fraction of squared student weights in the cr-subspace, and the teacher-student overlaps R u 
and R a , normalized within the corresponding sub-space, are represented on figures ^ to ^ as a function of a = M/N, 
using full and open symbols for the mappings «& n and <&„„ respectively. Notice that the abscissas correspond to the 
fraction of training patterns per input space dimension. Error bars are smaller than the symbols' size. The lines are 
not fits, but the theoretical curves corresponding to the same classes of teachers as the experimental results. The 
excellent agreement with the experimental data is striking. Thus, the high order correlations of the features, neglected 
in the theoretical models, are indeed negligible. 

Fig. [l] corresponds to a purely linear teacher (Q* = 0), i.e. to a quadratic SVM learning a rule linearly separable in 
input space. As in this case R a = 0, only R u and Q are represented. In the case of a purely quadratic rule, Q* = 1, 
represented on fig. |^, R u = 0. Notice that the corresponding overlaps, R u and do not have a similar behaviour, as 
the latter increases much slower than the former, irrespective of the mapping. This happens because, as the number 
of quadratic components scales like iVA, a number of examples of the order of NA are needed to learn them. Indeed, 
R u reaches a value close to 1 with a ~ 0(1) while R a needs a ~ 0(A) to reach similar values. 

Fig. |^ shows the results corresponding to the isotropic teacher, having Q* = Q* so = A/(l + A). For A = 25.5 
we have Q* so = 0.962 A particular case of such a teacher has all its weight components of equal absolute value, i.e. 
(w*) 2 = 1/N, and was studied in || and ||. Finally, the results corresponding to a general rule, with Q* = 0.5, 
are shown in fig. |J. Notice that at fixed a, R u decreases and R a increases with Q* at a rate that depends on the 
mapping. These quantities determine the student's generalization error through the combination (|l9|). The fact that 
they increase as a function of a with different speed is a signature of hierarchical learning. 

The generalization error e g corresponding to the different rules is plotted against a on fig. [s], for both mappings. 
At any fixed a, the performance obtained with the normalized mapping is better the smaller the value of Q*. The 
non-normalized mapping shows the opposite trend: its performance for a purely linear teacher is extremely bad, but 
it improves for increasing values of Q* and slightly overrides that of the normalized mapping in the case of a purely 
quadratic teacher. These results reflect the competition on learning the anisotropically distributed features. In the 
case of the normalized mapping, the cr-components are compressed (a 2 = 0.039) with respect to the M-components, 
which have unit variance. This is advantageous whenever the linear components carry the most significant information, 
which is the case for Q* < 1. When Q* = 1, the linear components only introduce noise that hinders the learning 
process. As the number of linear components is much smaller than the number of quadratic ones, their pernicious 
effect should be more conspicuous the smaller the value of A. Conversely, the non-normalized mapping has a 2 = 1.96, 
meaning that the compressed components are those of the it-subspace. Therefore, this mapping is better when most 
of the information is contained in the cr-subspace, which is the case for teachers with large Q* and, in particular, with 
Q* = 1. 

Finally, for the sake of completeness, the fraction of support vectors Psv = Msv/M, where M$v is the number 
of training patterns with maximal stability, is represented on figure o. This fraction is an upper bound to the 
generalization error. Notice that these curves present qualitatively the same trends as e g . Interestingly, p$v is smaller 
for the normalized mapping than for the non-normalized one for most of the rules. Since the student's weights can 
be expressed as a linear combination of SVs jD, this result is of practical interest. 
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FIG. 6. Fraction of learning patterns that belong to the subset of Support Vectors. 



IV. DISCUSSION 



In order to understand the results obtained in the previous section, we first analyze the relative behaviour of R u 
and R Gl which can be deduced from equation (|T3|). If A* <C A, which is the case for sufficiently small Q* , we get 
that R a <C R u - This means that the quadratic components are more difficult to learn than the linear ones. On the 
other hand, if the teacher lies mainly in the quadratic subspace, A* ^> A, and then R a > R u . The crossover between 
these different behaviours occurs at A* = A, for which equation ( |l3| ) gives R a — R u . For N = 50, which is the case 
in our simulations, this arises for Q* = 0.998 or Q* nn — 0.929, depending on whether we use the normalized or the 
non-normalized mapping. In the particular case of the isotropic teacher and the non-normalized mapping, Q* > Q*„, 
so that R a > R u , as shown on figure [| These considerations alone are not sufficient to understand the behaviour of 
the generalization error, which depends on the weighted sum of R a and R u (see equation (|f9|)). 

The behaviour at small a is useful to understand the onset of hierarchical learning. A close inspection of equations 
( |lO| - |T3] ) shows that in the limit a — > 0, x — a 2 and Q ~ Aa 2 /(Aa 2 + 1) to leading order in a. This results may 
be understood with the following simple argument: if there is only one training pattern, clearly it is a SV and the 
student's weight vector is proportional to it. As a typical example has N components of unit length in the u-subspace 
and NA components of length a in the a-subspace, we have Q — N Aa 2 / (N Aa 2 + N) . With the normalized mapping, 
linia^o Q = 1/2. In the case of the non normalized one lim a ^o Q = (2A — 1)/2A, which depends on the inflation 
factor of the SVM. In this limit, we obtain: 



f + o- 2 A l 

(21) 
(22) 

(23) 

V VI ^ ITU, V " 

Therefore, R ~ ^/a, like for the simple perceptron MMH p2[ , but with a prefactor that depends on the mapping and 
the teacher. 

In our model, we expect that hierarchical learning correspond to a fast increase of R at small a, mainly dominated 
by the contribution of R u . As in the limit a — > 0, 




R 



R u + R a ^AA* 



Vi + ^A^/TTAl 



(24) 



we expect hierarchical learning if a A <C f and A* < f . The first condition establishes a constraint on the mapping, 
which is only satisfied by the normalized one. The second condition, that ensures that R a < R u holds, gives the 
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FIG. 7. Generalization error of a SVM corresponding to different thermodynamic limits. See the text for the definition of a 
in each regime. 



range of teachers for which this hierarchical generalization take s place. Under these conditions, R grows fast and 
the contribution of R a is negligible because it is weighted by yj er 4 AA* . The effect of hierarchical learning is more 
important the smaller A*. The most dramatic effect arises for Q* = 0, i.e. for a quadratic SVM learning a linearly 
separable rule. 

On the other hand, if er 4 A 3> 1, which is the case for the non normalized mapping, both R u and R a contribute 
to R with comparable weights. Notice that, if the normalized mapping is used, the condition A* < 1 implies that 
Q* < Q* so = A/(l + A), where Q* so corresponds to the isotropic teacher. A straightforward calculation shows that 
a fraction of 47.5% of teachers satisfies this constraint for N = 50. In fact, the distribution of teachers as a function 
of Q* has its maximum at Q* so - When N — > oo, the distribution becomes 8{Q* — Q* so ), and Q* so tends to the 
median, meaning that in this limit, only about 50% of the teachers give raise to hierarchical learning when using the 
normalized mapping. 

In the limit a — > oo, all the generalization error curves converge to the same asymptotic value as the simple 
perceptron MMH learning in the feature space, namely e g — 0.500489(1 + A) /a, independently of a and Q*. Thus, 
t g vanishes slower the larger the inflation factor A. 

Finally, it is worth to point out that for a — 1, which would correspond to a normalizing factor a = yj A./N, the 
pattern distribution in feature space is isotropic. Irrespective of Q*, the corresponding generalization error is exactly 
the same as that of a simple perceptron learning the MMH with isotropically distributed examples in feature space. 

Since the inflation factor A of the SVM feature space in our approach is a free parameter, it does not diverge in the 
thermodynamic limit N — > oo . As a consequence, e g does not present any stepwise behaviour, but just a crossover 
between a fast decrease at small a followed by a slower decrease regime at large a. The results of Dietrich et al. || 
for the normalized mapping, that corresponds to a 2 A = 1 in our model, can be deduced by taking appropriately 
the limits before solving our saddle point equations. The regime where the number of training patterns M — aN 
scales with N, is straightforward. It is obtained by taking the limit a — > and A — » oo keeping a 2 A — 1 in our 
equations, with a finite. The regime where the number of training patterns M = aN scales with AA, the number of 
quadratic features, obtained by keeping a = a/(l + A) finite whilst taking, here again, the limit a — » 0, A — * oo with 
a 2 A — 1. The corresponding curves are represented on figure for the case of an isotropic teacher. In order to make 
the comparisons with our results at finite A, the regime where a is finite is represented as a function of a = (1 + A)a 
using the value of A corresponding to our numerical simulations, namely, A = 25.5. In the same figure we represented 
the generalization error e g = arccos(i?) where R, given by eq. (|l9|), is obtained after solving the saddle point 

equations with parameter values a 2 = 0.039 and A = 25.5. 

These results, obtained for quadratic SVMs, are easily generalizable to higher order polynomial SVMs. The cor- 
responding saddle point equations are cumbersome, and will not be given here. We expect a cascade of hierarchical 
generalization behaviour, in which successively more and more compressed features are learned. This may be un- 
derstood by considering the set of saddle point equations that generalize equation (13). These equations relate the 
teacher-student overlaps in the successive subspaces. The sequence of different feature subspaces generalized by the 
SVM depends on the relative complexity of the teacher and the student. This is contained in the factors A* / A m 
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corresponding to the m subspace, that appear in the set of equations that generalize eq. (Q_ 



V. CONCLUSION 



We introduced a model that clarifies some aspects of the generalization properties of polynomial Support Vector 
Machines (SVMs) in high dimensional feature spaces. To this end, we focused on quadratic SVMs. The quadratic 
features, which are the pairwise products of input components, may be scaled by a normalizing factor. Depending on 
its value, the generalization error presents very different behaviours in the thermodynamic limit |^j7j]- 

In fact, a finite size SVM may be caracterized by two parameters: A and a. The inflation factor A is the ratio 
between the quadratic and the linear features dimensions. Thus, it is proportional to the input space dimension N. 
The variance a of the quadratic features is related to the corresponding normalizing factor. Usually, either a ~ 1/yN 
(normalized mapping) or a ~ 1 (non normalized mapping). In previous studies, not only the input space dimension 
diverges in the thermodynamic limit N — > oo, but also A and a are correspondingly scaled. 

In our model, neither the proportion of quadratic features A nor their variance a are necessarily related to the 
input space dimension N. They are considered as parameters caracterizing the SVMs. Since we keep them constant 
when taking the thermodynamic limit, we can study the learning properties of actual SVMs with finite inflation ratios 
and normalizing factors, as a function of a = M/N, where M is the number of training examples. Our theoretical 
results were obtained neglecting the correlations among the quadratic features. The agreement between our computer 
experiments with actual SVMs and the theoretical predictions is excellent. The effect of the correlations does not 
seem to be important, as there is almost no difference between the theoretical curves and the numerical results. 

We find that the generalization error e g depends on the type of rule to be inferred through Q* , the (normalized) 
sum of the teacher's squared weight components in the quadratic subspace. If Q* is small enough, the quadratic 
components need more patterns to be learned than the linear ones. However, only if the quadratic features are 
normalized, e g is dominated by the high rate learning of the linear components at small a. Then, on increasing a, 
there is a crossover to a regime where the decrease of e g becomes much slower. The crossover between these two 
behaviours is smoother for larger values of Q* , and this effect of hierarchical learning disappears for large enough 
Q* . On the other hand, if the features are not normalized, the contributions of both the linear and the quadratic 
components to e g are of the same order, and there is no hierarchical learning at all. 

In the case of the normalized mapping, if the limits A ~ N — > oo and a 2 ~ 1/N — > are taken together with the 
thermodynamic limit, the hierarchical learning effect gives raise to the two different regimes, corresponding to M ~ N 
or M ~ N 2 , described previously |||| . 

It is worth to point out that if the rule to be learned allows for hierarchical learning, the generalization error of 
the normalized mapping is much smaller than that of the non normalized one. In fact, the teachers corresponding 
to such rules are those with Q* < Q* so , where Q* so corresponds to the isotropic teacher, the one having all its 
weights components equal. For the others, both the normalized mapping and the non normalized one present similar 
performances. If the weights of the teacher are selected at random on a hypersphere in feature space, the most 
probable teachers have precisely Q* — Q* so , and the fraction of teachers with Q* < Q* so represent of the order of 
50% of the inferable rules. Thus, from a practical point of view, without having any prior knowledge about the rule 
underlying a set of examples, the normalized mapping should be preferred. 
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